Project Scope

In this project, I went over a basic Python data analysis pipeline from start to finish. In the second part of this notebook, I analyzed factors that correlated with the SAT scores, and did some exploratory visualization and analysis.

Step1: Computing correlations

To explore a dataset and see what columns are related to the one cared about (SAT score in this case) is to compute correlations. This will tell which columns are closely related to the SAT score. We can do this via the corr method on Pandas DataFrames. The closer to 0 the correlation, the weaker the connection. The closer to 1, the stronger the positive correlation, and the closer to -1, the stronger the negative correlation:

In [1]:
import pandas as pd

full=pd.read_csv('full.csv')
full.corr()['sat_score']
Out[1]:
Unnamed: 0                          -0.025372
Grade                                     NaN
Year                                      NaN
Number Tested                        0.055520
Mean Scale Score                     0.676679
Level 1 #                           -0.509504
Level 1 %                           -0.574281
Level 2 #                           -0.390404
Level 2 %                           -0.656644
Level 3 #                            0.382374
Level 3 %                            0.427424
Level 4 #                            0.628875
Level 4 %                            0.616284
Level 3+4 #                          0.545082
Level 3+4 %                          0.674423
SAT Critical Reading Avg. Score      0.987692
SAT Math Avg. Score                  0.972780
SAT Writing Avg. Score               0.987845
sat_score                            1.000000
schoolyear                                NaN
fl_percent                                NaN
frl_percent                         -0.685127
total_enrollment                     0.376035
ell_num                             -0.132403
ell_percent                         -0.388161
sped_num                             0.041456
sped_percent                        -0.452665
asian_num                            0.482291
asian_per                            0.563766
black_num                            0.044814
                                       ...   
rr_s                                 0.212396
rr_t                                -0.034143
rr_p                                -0.000027
N_s                                  0.419905
N_t                                  0.287289
N_p                                  0.391698
saf_p_11                             0.096052
com_p_11                            -0.110301
eng_p_11                             0.004024
aca_p_11                             0.018074
saf_t_11                             0.268028
com_t_11                             0.071342
eng_t_10                                  NaN
aca_t_11                             0.105755
saf_s_11                             0.286231
com_s_11                             0.141411
eng_s_11                             0.169146
aca_s_11                             0.285307
saf_tot_11                           0.263167
com_tot_11                           0.056865
eng_tot_11                           0.076273
aca_tot_11                           0.155294
CSD                                  0.020645
NUMBER OF STUDENTS / SEATS FILLED    0.395938
NUMBER OF SECTIONS                   0.369681
AVERAGE CLASS SIZE                   0.328050
SIZE OF SMALLEST CLASS               0.205278
SIZE OF LARGEST CLASS                0.266764
SCHOOLWIDE PUPIL-TEACHER RATIO            NaN
school_dist                          0.020645
Name: sat_score, Length: 88, dtype: float64

This gives us quite a few insights that we'll need to explore:

  • Total enrollment correlates strongly with sat_score, which is surprising, because you'd think smaller schools, which focused more on the student, would have higher scores.
  • The percentage of females at a school (female_per) correlates positively with SAT score, whereas the percentage of males (male_per) correlates negatively.
  • None of the survey responses correlate highly with SAT scores.
  • There is a significant racial inequality in SAT scores (white_per, asian_per, black_per, hispanic_per).
  • ell_percent correlates strongly negatively with SAT scores.
  • Each of these items is a potential angle to explore and tell a story about using the data.

Step2: Setting the context

Before we dive into exploring the data, we'll want to set the context, both for ourselves, and anyone else that reads our analysis. One good way to do this is with exploratory charts or maps. In this case, we'll map out the positions of the schools, which will help readers understand the problem we're exploring. In the below code, we:

  • Setup a map centered on New York City.
  • Add a marker to the map for each high school in the city.
  • Display the map.
In [2]:
import folium
from folium import plugins
import pandas as pd

full=pd.read_csv('full.csv')

schools_map = folium.Map(location=[full['lat'].mean(), full['lon'].mean()], zoom_start=10)
marker_cluster = folium.plugins.MarkerCluster().add_to(schools_map)
for name, row in full.iterrows():
    folium.Marker(location = [row["lat"], row["lon"]], popup="{0}: {1}").add_to(marker_cluster)
    
schools_map.save("schools.html")
schools_map
C:\Users\User\Anaconda2\lib\site-packages\folium\__init__.py:59: UserWarning: This version of folium is the last to support Python 2. Transition to Python 3 to be able to receive updates and fixes. Check out https://python3statement.org/ for more info.
  UserWarning
Out[2]:

This map is helpful, but it's hard to see where the most schools are in NYC. Instead, we'll make a heatmap:

In [3]:
schools_heatmap = folium.Map(location=[full['lat'].mean(), full['lon'].mean()], zoom_start=10)
schools_heatmap.add_child(plugins.HeatMap([[row["lat"], row["lon"]] for name, row in full.iterrows()]))
schools_heatmap.save("heatmap.html")
schools_heatmap
Out[3]:

Step3: District level mapping

Heatmaps are good for mapping out gradients, but we'll want something with more structure to plot out differences in SAT score across the city. School districts are a good way to visualize this information, as each district has its own administration. New York City has several dozen school districts, and each district is a small geographic area. We can compute SAT score by school district, then plot this out on a map. In the below code, we'll:

  • Group full by school district.
  • Compute the average of each column for each school district.
  • Convert the school_dist field to remove leading 0s, so we can match our geograpghic district data.
In [4]:
import numpy as np

district_data = full.groupby("school_dist").agg(np.mean)
district_data.reset_index(inplace=True)
district_data["school_dist"] = district_data["school_dist"].apply(lambda x: str(int(x)))
district_data
Out[4]:
school_dist Unnamed: 0 Grade Year Number Tested Mean Scale Score Level 1 # Level 1 % Level 2 # Level 2 % ... com_tot_11 eng_tot_11 aca_tot_11 CSD NUMBER OF STUDENTS / SEATS FILLED NUMBER OF SECTIONS AVERAGE CLASS SIZE SIZE OF SMALLEST CLASS SIZE OF LARGEST CLASS SCHOOLWIDE PUPIL-TEACHER RATIO
0 1 129.857143 8.0 2011.0 67.501931 661.500000 8.500000 17.100000 24.500000 47.300000 ... 6.842857 7.200000 7.728571 1.0 117.511870 5.202949 22.704137 18.365008 26.843786 NaN
1 2 137.019608 8.0 2011.0 72.028087 674.000000 3.000000 9.400000 10.500000 33.200000 ... 6.674510 6.990196 7.490196 2.0 164.399306 6.279080 25.113515 20.289521 29.108830 NaN
2 3 137.785714 8.0 2011.0 68.030888 686.000000 6.000000 10.900000 16.000000 29.100000 ... 6.628571 6.935714 7.435714 3.0 146.892356 5.625330 22.818868 18.948012 26.120322 NaN
3 4 76.428571 8.0 2011.0 67.544402 679.666667 2.666667 3.266667 27.333333 40.833333 ... 7.271429 7.500000 8.071429 4.0 129.413285 4.978325 24.787783 21.016327 27.582740 NaN
4 5 126.000000 8.0 2011.0 88.015444 677.750000 6.000000 9.100000 28.000000 33.350000 ... 6.657143 6.900000 7.528571 5.0 114.864286 4.612271 24.049038 20.055357 27.602473 NaN
5 6 98.700000 8.0 2011.0 76.791892 672.333333 8.666667 10.033333 29.666667 34.900000 ... 6.920000 7.170000 7.760000 6.0 139.041709 5.285860 25.324199 20.556126 29.217872 NaN
6 7 124.066667 8.0 2011.0 79.715315 668.250000 10.250000 11.725000 40.500000 45.400000 ... 6.733333 7.013333 7.560000 7.0 96.776584 4.064317 23.927326 19.879125 27.230962 NaN
7 8 192.500000 8.0 2011.0 72.102317 667.000000 5.500000 8.800000 28.000000 47.700000 ... 6.450000 6.685714 7.235714 8.0 130.245547 5.258212 23.330603 18.631070 27.202677 NaN
8 9 137.136364 8.0 2011.0 71.024570 669.857143 5.714286 9.385714 25.714286 41.628571 ... 6.809091 7.090909 7.654545 9.0 122.537425 5.015612 24.334560 19.949618 28.586596 NaN
9 10 169.652174 8.0 2011.0 85.854289 668.400000 11.200000 10.120000 57.800000 47.720000 ... 6.569565 6.904348 7.452174 10.0 170.895481 6.593661 24.285629 18.880024 28.639997 NaN
10 11 192.562500 8.0 2011.0 74.721284 651.000000 26.000000 28.900000 45.000000 50.000000 ... 6.762500 6.975000 7.537500 11.0 132.926151 5.152383 25.102623 19.900881 29.630533 NaN
11 12 164.076923 8.0 2011.0 74.178794 670.500000 8.500000 10.500000 30.500000 40.300000 ... 6.976923 7.184615 7.769231 12.0 80.646589 3.448898 22.833851 19.409931 25.938405 NaN
12 13 232.928571 8.0 2011.0 74.723938 677.000000 2.000000 2.300000 32.000000 36.400000 ... 6.828571 7.071429 7.564286 13.0 207.911745 7.671935 24.529921 19.616295 28.614042 NaN
13 14 195.733333 8.0 2011.0 76.495495 661.666667 13.666667 15.133333 46.666667 51.400000 ... 6.966667 7.293333 7.806667 14.0 118.200813 4.711452 24.090538 19.651282 28.010821 NaN
14 15 133.200000 8.0 2011.0 56.381081 664.500000 5.333333 12.550000 21.666667 48.350000 ... 6.480000 6.720000 7.380000 15.0 129.795411 5.065871 25.097750 20.528505 29.093608 NaN
15 16 186.250000 8.0 2011.0 70.777027 657.000000 12.000000 19.400000 36.000000 58.100000 ... 5.975000 6.375000 6.825000 16.0 177.501282 6.970513 24.234006 18.070192 29.308333 NaN
16 17 217.285714 8.0 2011.0 68.837838 675.666667 5.000000 11.033333 15.000000 31.466667 ... 6.700000 7.014286 7.535714 17.0 131.510861 5.145898 25.536596 20.421458 29.678081 NaN
17 18 259.090909 NaN 2011.0 73.702703 NaN NaN NaN NaN NaN ... 6.618182 6.909091 7.436364 18.0 70.205019 3.304066 21.596982 18.344536 24.869170 NaN
18 19 251.307692 8.0 2011.0 71.879418 708.000000 0.000000 0.000000 1.000000 2.000000 ... 6.607692 6.846154 7.338462 19.0 101.694730 4.129952 23.947397 19.707307 27.703469 NaN
19 20 245.200000 8.0 2011.0 70.362162 680.000000 0.000000 0.000000 21.000000 36.800000 ... 7.080000 7.380000 7.800000 20.0 420.029766 14.721019 25.423366 17.681562 29.745831 NaN
20 21 261.090909 8.0 2011.0 74.457002 688.000000 0.000000 0.000000 17.000000 20.700000 ... 6.609091 6.909091 7.390909 21.0 224.702989 8.191627 25.284294 20.167110 28.885220 NaN
21 22 273.600000 8.0 2011.0 64.962162 688.000000 0.000000 0.000000 8.000000 26.700000 ... 6.780000 7.200000 7.700000 22.0 469.723495 15.381913 29.290109 21.457740 33.121792 NaN
22 23 131.000000 8.0 2011.0 67.900901 662.500000 10.500000 13.700000 31.500000 47.050000 ... 6.700000 7.000000 7.566667 23.0 120.113095 4.985119 24.299702 18.654762 28.994048 NaN
23 24 265.250000 8.0 2011.0 71.560811 700.000000 0.000000 0.000000 4.000000 8.300000 ... 6.791667 7.175000 7.716667 24.0 221.589007 8.753546 24.210875 18.756571 28.611316 NaN
24 25 205.375000 8.0 2011.0 75.314189 690.000000 2.666667 3.466667 18.000000 23.333333 ... 6.837500 7.112500 7.612500 25.0 280.576007 9.990385 27.106270 21.213189 31.371220 NaN
25 26 297.000000 NaN 2011.0 73.702703 NaN NaN NaN NaN NaN ... 6.540000 6.900000 7.300000 26.0 595.953216 19.240117 29.971140 20.315789 33.554327 NaN
26 27 231.300000 8.0 2011.0 85.191892 683.666667 2.000000 2.633333 28.000000 35.800000 ... 6.710000 6.950000 7.510000 27.0 249.324536 9.019043 26.649659 20.757535 30.938081 NaN
27 28 255.500000 8.0 2011.0 76.277027 684.333333 2.666667 3.400000 22.333333 27.433333 ... 6.841667 7.058333 7.666667 28.0 255.555540 8.799487 26.661398 20.930314 30.280509 NaN
28 29 269.555556 8.0 2011.0 73.213213 660.500000 11.000000 17.000000 40.000000 55.500000 ... 6.800000 7.044444 7.511111 29.0 87.819694 3.366690 26.056858 22.126698 29.368352 NaN
29 30 267.888889 8.0 2011.0 74.657658 697.500000 0.500000 0.600000 8.000000 9.400000 ... 6.744444 7.077778 7.588889 30.0 251.803744 9.260486 25.715644 19.312273 29.760726 NaN
30 31 341.900000 NaN 2011.0 73.702703 NaN NaN NaN NaN NaN ... 6.870000 7.200000 7.610000 31.0 380.528319 13.251284 28.119729 19.995494 32.785641 NaN
31 32 293.833333 8.0 2011.0 67.418919 685.000000 0.000000 0.000000 8.000000 22.200000 ... 6.750000 7.200000 7.733333 32.0 100.525613 4.450572 22.479804 17.936520 26.816013 NaN

32 rows × 88 columns

Store the district_data in csv format.

In [5]:
district_data.to_csv('district_data.csv')

We'll now we able to plot the average SAT score in each school district. In order to do this, we'll read in data in GeoJSON format to get the shapes of each district, then match each district shape with the SAT score using the school_dist column, then finally create the plot:

In [6]:
import json
import pandas as pd

def show_district_map(col):
    full=pd.read_csv('full.csv')
    district_data=pd.read_csv('district_data.csv')
    geo_data =r'schoolDistricts.json'
    geo_json_data=json.load(open(geo_data))
    #Create a map:
    districts = folium.Map(location=[full['lat'].mean(), full['lon'].mean()], zoom_start=10)
    #Create a layer, shaded by col:
    folium.Choropleth(
        geo_json_data,
        data=district_data,
        columns=['school_dist', col],
        key_on='feature.properties.SchoolDist',
        fill_color='YlGn',
        fill_opacity=0.7,
        line_opacity=0.2,
    ).add_to(districts)
    #Output the map to an .html file:
    districts.save(outfile='districts_1.html')
    return districts

Step4: Exploring enrollment and SAT scores

Now that we've set the context by plotting out where the schools are, and SAT score by district, people viewing our analysis have a better idea of the context behind the dataset. Now that we've set the stage, we can move into exploring the angles we identified earlier, when we were finding correlations. The first angle to explore is the relationship between the number of students enrolled in a school and SAT score. We can explore this with a scatter plot that compares total enrollment across all schools to SAT scores across all schools.

In [7]:
%matplotlib inline

full.plot.scatter(x='total_enrollment', y='sat_score')
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0xc656ba8>

As you can see, there's a cluster at the bottom left with low total enrollment and low SAT scores. Other than this cluster, there appears to only be a slight positive correlation between SAT scores and total enrollment. Graphing out correlations can reveal unexpected patterns. We can explore this further by getting the names of the schools with low enrollment and low SAT scores:

In [8]:
full[(full["total_enrollment"] < 1000) & (full["sat_score"] < 1000)]["School Name"]
Out[8]:
32     INTERNATIONAL SCHOOL FOR LIBERAL ARTS
139      INTERNATIONAL COMMUNITY HIGH SCHOOL
165                                      NaN
172    KINGSBRIDGE INTERNATIONAL HIGH SCHOOL
204            HIGH SCHOOL OF WORLD CULTURES
211       BROOKLYN INTERNATIONAL HIGH SCHOOL
235    INTERNATIONAL HIGH SCHOOL AT PROSPECT
243               IT TAKES A VILLAGE ACADEMY
257                MULTICULTURAL HIGH SCHOOL
Name: School Name, dtype: object

Some searching on Google shows that most of these schools are for students who are learning English, and are low enrollment as a result. This exploration showed us that it's not total enrollment that's correlated to SAT score -- it's whether or not students in the school are learning English as a second language or not.

Step5: Exploring English language learners and SAT scores

Now that we know the percentage of English language learners in a school is correlated with lower SAT scores, we can explore the relationship. The ell_percent column is the percentage of students in each school who are learning English. We can make a scatterplot of this relationship:

In [9]:
full.plot.scatter(x='ell_percent', y='sat_score')
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0xc8128d0>

It looks like there are a group of schools with a high ell_percentage that also have low average SAT scores. We can investigate this at the district level, by figuring out the percentage of English language learners in each district, and seeing it if matches our map of SAT scores by district:

In [10]:
show_district_map("ell_percent")
Out[10]:
In [11]:
show_district_map("sat_score")
Out[11]:

As we can see by looking at the two district level maps, districts with a low proportion of ELL learners tend to have high SAT scores, and vice versa.

Step6: Correlating survey scores and SAT scores

It would be fair to assume that the results of student, parent, and teacher surveys would have a large correlation with SAT scores. It makes sense that schools with high academic expectations, for instance, would tend to have higher SAT scores. To test this theory, lets plot out SAT scores and the various survey metrics:

In [12]:
full.corr()["sat_score"][["rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_tot_11", "com_tot_11", "aca_tot_11", "eng_tot_11"]].plot.bar()
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0xd745a58>

Surprisingly, the two factors that correlate the most are N_p and N_s, which are the counts of parents and students who responded to the surveys. Both strongly correlate with total enrollment, so are likely biased by the ell_learners. The other metric that correlates most is saf_t_11. That is how safe students, parents, and teachers perceived the school to be. It makes sense that the safer the school, the more comfortable students feel learning in the environment. However, none of the other factors, like engagement, communication, and academic expectations, correlated with SAT scores.

Step7: Exploring race and SAT scores

One of the other angles to investigate involves race and SAT scores. There was a large correlation differential, and plotting it out will help us understand what's happening:

In [13]:
full.corr()["sat_score"][["white_per", "asian_per", "black_per", "hispanic_per"]].plot.bar()
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0xc812908>

It looks like the higher percentages of white and asian students correlate with higher SAT scores, but higher percentages of black and hispanic students correlate with lower SAT scores. For hispanic students, this may be due to the fact that there are more recent immigrants who are ELL learners. We can map the hispanic percentage by district to eyeball the correlation:

In [14]:
show_district_map("hispanic_per")
Out[14]:
In [15]:
show_district_map("ell_percent")
Out[15]:

Step8: Gender differences in SAT scores

The final angle to explore is the relationship between gender and SAT score. We noted that a higher percentage of females in a school tends to correlate with higher SAT scores. We can visualize this with a bar graph:

In [16]:
full.corr()["sat_score"][["male_per", "female_per"]].plot.bar()
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x10ab3ba8>

To dig more into the correlation, we can make a scatterplot of female_per and sat_score:

In [17]:
full.plot.scatter(x='female_per', y='sat_score')
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0xd6db160>

It looks like there's a cluster of schools with a high percentage of females, and very high SAT scores (in the top right). We can get the names of the schools in this cluster:

In [18]:
full[(full["female_per"] > 65) & (full["sat_score"] > 1400)]["School Name"]
Out[18]:
2             PROFESSIONAL PERFORMING ARTS HIGH SCH
76                   BARD HIGH SCHOOL EARLY COLLEGE
92                    ELEANOR ROOSEVELT HIGH SCHOOL
105                    TALENT UNLIMITED HIGH SCHOOL
123            FIORELLO H. LAGUARDIA HIGH SCHOOL OF
293                     TOWNSEND HARRIS HIGH SCHOOL
324    FRANK SINATRA SCHOOL OF THE ARTS HIGH SCHOOL
Name: School Name, dtype: object

Searching Google reveals that these are elite schools that focus on the performing arts. These schools tend to have higher percentages of females, and higher SAT scores. This likely accounts for the correlation between higher female percentages and SAT scores, and the inverse correlation between higher male percentages and lower SAT scores.

Step9: AP score

So far, we've looked at demographic angles. One angle that we have the data to look at is the relationship between more students taking Advanced Placement exams and higher SAT scores. It makes sense that they would be correlated, since students who are high academic achievers tend to do better on the SAT.

In [19]:
full["ap_avg"] = full["AP Test Takers "] / full["total_enrollment"]

full.plot.scatter(x='ap_avg', y='sat_score')
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x12883860>

It looks like there is indeed a strong correlation between the two. An interesting cluster of schools is the one at the top right, which has high SAT scores and a high proportion of students that take the AP exams:

In [20]:
full[(full["ap_avg"] > .3) & (full["sat_score"] > 1700)]["School Name"]
Out[20]:
92             ELEANOR ROOSEVELT HIGH SCHOOL
102                   STUYVESANT HIGH SCHOOL
181             BRONX HIGH SCHOOL OF SCIENCE
184    HIGH SCHOOL OF AMERICAN STUDIES AT LE
210           BROOKLYN TECHNICAL HIGH SCHOOL
293              TOWNSEND HARRIS HIGH SCHOOL
313    QUEENS HIGH SCHOOL FOR THE SCIENCES A
Name: School Name, dtype: object

Some Google searching reveals that these are mostly highly selective schools where you need to take a test to get in. It makes sense that these schools would have high proportions of AP test takers.